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Abstract 

Background: Plant breeders use an increasingly diverse range of data types to identify lines with desirable 
characteristics suitable to be taken forward in plant breeding programmes. There are a number of key 
morphological and physiological traits, such as disease resistance and yield that need to be maintained and 
improved upon if a commercial variety is to be successful. Computational tools that provide the ability to integrate 
and visualize this data with pedigree structure, will enable breeders to make better decisions on the lines that are 
used in crossings to meet both the demands for increased yield/production and adaptation to climate change. 

Results: We have used a large and unique set of experimental barley (H. vulgare) data to develop a prototype 
pedigree visualization system. We then used this prototype to perform a subjective user evaluation with domain 
experts to guide and direct the development of an interactive pedigree visualization tool called Helium. 

Conclusions: We show that Helium allows users to easily integrate a number of data types along with large plant 
pedigrees to offer an integrated environment in which they can explore pedigree data. We have also verified that 
users were happy with the abstract representation of pedigrees that we have used in our visualization tool. 



Background 

The effects of climate change and ensuring food security 
in a world with an increasing population is becoming 
ever more pertinent [1-3]. The exploitation of pedigrees 
in plant breeding allows breeders to target specific plant 
crosses to maximise the potential of achieving desirable 
agriculturally important characteristics such as yield, 
drought/water tolerance and disease resistance which 
will be required if new varieties are to be bred to cope 
with increased demand in a changing environment. 

The ability to predict and visualize the inheritance of 
alleles that facilitate resistance to pathogens or any other 
commercially important characteristic is crucially im- 
portant to experimental plant genetics and commercial 
plant breeding programmes. Derivation of the inhe- 
ritance of such traits by traditional molecular techniques 
is expensive and time consuming, even with recent 
developments in high-throughput technologies. This is 
especially true in industrial settings where, due to time 
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constraints relating to growing seasons, many thousands 
of plant lines may need to be screened quickly, effi- 
ciently and economically every year. 

Due to their complexity, there is a cognitive limitation 
in conceptualising large pedigree structures. 

While it may not be achievable or indeed necessary to 
understand every mating relationship between related 
individuals, an overall picture can lead to insight into 
the data and any patterns it may contain. This can also 
aid in the identification of problems (both biological and 
data handling issues) within datasets when coupled with 
expert domain knowledge. 

This is particularly important when looking at pedi- 
gree data as the context in which each line sits may hold 
additional and important information (such as the inher- 
itance of particular genome regions from ancestral var- 
ieties). It is because of this that a combination of visual 
and statistical analytics would allow geneticists and com- 
mercial breeders to gain a deeper understanding of the 
transmission of genetic elements within a pedigree based 
framework but there is currently a lack of suitable tools 
to analyse these data types. 

Software tools that offered improvements in the speed 
at which this analysis can be carried out, and increase 
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users' ability to conceptualise large pedigrees would 
bring both time and cost gains to breeding companies. 

Using a unique and extensive barley dataset covering 
pedigree, genotypic and phenotypic data for UK elite 
germplasm which has been through the UK National 
List Testing procedures [4], we discuss the challenges of 
visualizing the transmission of alleles encoding traits and 
characteristics of agricultural importance in a pedigree- 
based framework. We then describe the subsequent 
development of a pedigree visualization tool that was 
implemented in close collaboration with domain experts. 

Plant pedigrees 

A pedigree (Figure 1) is a representation of how gene- 
tically discrete individuals are related (usually but not 
exclusively) in time to one another. It is therefore a repre- 
sentation of the genetic relationship between individual 
plant lines, their parents and progeny (predecessors and 
successors). Pedigrees are often used in human contexts 
to show the transmission of alleles responsible for genetic 
conditions of medical importance. In plants they are used 
as a framework along with environmental data, on which 
statistical analysis can be used to determine factors such 
as mode of inheritance (Identity by Descent, IBD and 
Identity by Association, IB A). Additionally, they are often 
used to check for potential genotyping errors, since these 
errors, by the very nature of Mendelian inheritance, are 
constrained by the pedigree structure in which they exist 
[5]. The accurate representation of pedigrees is therefore 
becoming increasingly important in plant breeding and 
genetics. 

While there are defined standard nomenclatures for 
human pedigrees [7] there is no single formal system for 
plant pedigrees, however, there are moves towards 



defining standards. There are valid biological reasons for 
this including: the hermaphrodite nature of most plant 
species, the complexity of mating designs possible in 
plant genetics and, finally, the absence of any overseeing 
coordinating organisation. 

While plant and animal breeding share routine breeding 
techniques such as standard crossing and back-crossing, 
pedigrees used in plant breeding display some subtle but 
important differences, often involving key shorthand con- 
ventions that are unique to plant mating designs leading 
to complex textual based records which can be difficult to 
read (see 'Pedigree formats' subsection). Firstly, the named 
entities in plant pedigrees may, but not always, represent a 
population of genetically identical individuals, not a single 
plant. While it is relatively simple to grow many plants 
from seed, potentially many decades after production, in 
humans and animals this is understandably not the norm. 
The generation of these genetically identical (homozygous) 
varieties is possible through doubled haploidy, inbreed- 
ing, or crossing of pairs of inbred lines to achieve what 
is termed an Fl hybrid. Successive inbreeding by self- 
pollination of these Fl generation plants leads to indi- 
vidual plants that are close to homozygous across all 
alleles. The exploitation of homozygous lines in crop 
species such as barley is a powerful tool in genetic ana- 
lysis, removing some of the genetic complexities asso- 
ciated with species (such as humans) where there is a 
high level of heterozygosity. 

Pedigree formats 

a. A/B//C//D 

b. ((A B) ^^C) ^^D 

c. [A X [(B X C) ' D] X E] * [F X A] X C 
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Figure 1 Traditional barley pedigree. Common representation of a barley pedigree sliowing Elite cultivars [6]. These representations cover only 
a limited number of lines, are commonly seen in humans and animals and are therefore easy to read. 
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Genetic transmission 

Pedigree formats can be complex with no standard no- 
menclature; a. Purdy Notation System [8] was put forward 
by Purdy as a common format for representing small grain 
cereal pedigrees. Forward slashes 7' are used to delimit 
lines. In this case A is crossed with B which is then 
crossed with C whose progeny is crossed with D. b. Lama- 
craft and Finlay notation [9] which was put forward as a 
format which could be more easily parsed by computers. 
The example here is the same as in the Purdy notation 
above, c. A typical pedigree that can be found in old re- 
cords where a mixture of notations are used. These mixed 
notation systems are common and most breeders will use 
shorthand that is unique to them. These records are 
sometimes difficult to read and would benefit from being 
represented in a more user friendly way. 

Data sets 

There are a number of different data types used in this 
work. Our primary data set is composed of a large barley 
pedigree data set for 803 UK Elite cultivars as well as 
Single Nucleotide Polymorphism (SNP) genotypic data 
for 750 of these lines across 4,769 genetic markers. In 
addition, phenotypic data for these lines for 33 Distinc- 
tiveness, Uniformity and Stability (DUS) characters [10] 
across multiple years and sites was used (1980 - present 
which equates to 601,148 data points). Datasets covering 
UK wheat (Tritticum spp.) and Asian rice {Oryza sativa) 
were also used in this work although these are more li- 
mited in size. Data are stored in the Germinate 2 data- 
base system. [11]. The ability to connect to Germinate 
was an important design decision as allowing users to 
access all background information on plant lines that we 
had available was important. 

Pedigree definitions 

The nucleus of pedigree data are a series of parent/child 
relationships defined as encoded strings (see 'Pedigree 
formats' subsection) [8,9]. Data was atomised into simple 
parent/child definitions which were used to dynamically 
reconstruct the pedigree. In addition there may also be 
information identif)^ing whether the parent was male or 
female and the type of genetic cross performed. Some- 
thing unique in plant breeding is where a plant can be 
both male and female parents in the same cross. 

Complications may arise from either older pedigree 
data which is error prone and may be difficult to verify 
without expert guidance and from the re-use of names 
to describe varieties creating false relationship joins. It is 
not uncommon for a breeder s favourite name to be used 
multiple times until a line is adequately different, and 
has sufficient performance to be accepted for wider dis- 
tribution into the UK recommended list programme. 



Genotypic data 

The genotypic data set for our study is based on a set of 
SNP markers which are mapped to known chromosome 
positions in the barley genome. Each plant line within 
the test set has been genotyped for a set of 7,000 of 
these markers. 

A given plant variety will have an allele call for each of 
a series of loci represented as a pair of nucleotide bases 
e.g. AA, GG (which are homozygous) or AG (which are 
heterozygous), for a locus. Due to the inbred nature of 
our barley germplasm there are low levels (less than 
0.5%) of residual heterozygosity present. 

Phenotypic data 

The phenotypic data in our study has been either collected 
in field experiments or by molecular testing. Though 
many of the agriculturally important traits are controlled 
by many genes of small effect (quantitative traits) for sim- 
plicity we concentrated on traits under simple genetic 
control. Examples of such traits include DUS characte- 
ristics which are used in the varietal registration and seed 
certification process and allele data on disease resistance 
genes such as Mlo and Mia, 

Previous work 

The ability to visualise data is imperative in modern ex- 
perimental plant genetics, with volumes of data being rou- 
tinely produced far exceeding the ability for humans to 
digest and identify underlying phenomena. Until now, 
pedigree visualization, with few exceptions [12,13] has pri- 
marily been focussed on work carried out in the human 
genetics domain. Because plant breeding programmes in- 
volve phenomena not normally seen in human popula- 
tions, such as routine inbreeding, there are additional 
visualization challenges that need to be overcome. There 
are often large numbers of plant lines involved in any 
pedigree, many more so than in an average human pedi- 
gree due to factors such as generation time/time to sexual 
maturity which is far lower in most plant species than that 
of their mammalian counterparts. This section will look at 
the various visualization techniques used to represent 
pedigree based data and highlight the problems and 
strengths that these techniques exhibit. 

Table-based approaches 

Table-based visualization tools such as Flapjack [14] ad- 
dress some of the problems associated with visualizing 
large datasets and are optimized for efficient sorting 
and querying of genotypic and phenotypic data, but cur- 
rently lack the ability to display data on a pedigree- 
based scaffold. 

While other tools such as PedStats [15] offer statistical 
validation of users' pedigree data without visualization 
of the actual pedigree structure, it is difficult if not 
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impossible to conceptualize pedigree structure for com- 
plex data sets without some visual representation. 

Matrix-based visualizations to represent pedigrees use 
the intersection of the x and y edge to define relation- 
ships. Matrix-based visualizations have advantages over 
node-link or graph-centred layout approaches including 
the ability to create compact graph representations and 
the ability to remove edge overlapping. However, tests 
generating matrix visualizations using our pedigree data 
have shown that the data density is so low the resulting 
representations are not particularly insightful The ability 
to easily track flow and identify paths is also removed. 

Tools such as GeneaQuilts [16], offer a new visua- 
lization technique suitable for use with thousands of in- 
dividuals but offer limited scope for addition of complex 
genotypic and phenotypic data and discussions with our 
users showed that they found it difficult to easily inter- 
pret such representations. 

Finally, tools such as VIPER [17] offer novel pedigree 
visualization and genotypic error checking capabilities. 
VIPER is essentially a stack of nested table represen- 
tations of generations where rows represent sires, dams 
or children and columns represent individuals which can 
span multiple columns where they are parents. VIPERs 
primary use is in identification of genotyping problems 
in farmed animals and would be unsuitable for vi- 
sualizing the complex crossing relationships that exist 
between crops where selfing is not uncommon. VIPER 
requires both separate male and female parents which is 
the norm in any applications handling animal or human 
data, but not always the case in plant breeding. 

Graph-based 

Unlike trees, graphs allow for the precise modelling of the 
complexity of a plant breeding programme. Techniques 
such as node link diagrams have long been used as a way 
of representing graph-based data and recent work has 
examined how effective the node-link model performs 
representing graph data when compared to matrix-based 
visualizations [18]. Work carried out by Purchase [19,20] 
and Bennett [21] also indicated that while graph layout 
played an important part in a user s understanding, it was 
not the major focus; this focus perhaps being the use of 
other aesthetics relating to node colour and shape. 

Most of the current tools have been developed for hu- 
man pedigrees where consanguineous mating events are 
negligible. This is not the case in plant and animal 
breeding which cannot be properly modelled using tools 
that use node-link or tree hierarchies such as Pedfiddler 
and Madeline [22]. 

Cranefoot [23] reports the use of mathematical graph 
structures to deal with between-relative mating but the 
approach is limited in its current form in the amount of 
information that can be attached to a node. Finally, 



HaploPainter [24] allows the drawing of genetic haplo- 
types, but suffers from being restricted in the number of 
individuals it is able to display. 

A commonly used two-dimensional pedigree visua- 
lization tool is Peditree [13] which offers a tree-based view 
of data in a pedigree but this is not suited to our require- 
ments as plant pedigrees are not trees (inbreeding and the 
use of older lines in more modern crosses prevents us 
from treating them as such). Other tools such as the 
Pedigree Visualizer by Wong [25] offer new layout algo- 
rithms. Wong suggests introducing duplicate "alias" lines 
in representations with multiple matings from the same 
individuals, phenomena that are commonplace in plant 
data. PyPedal [26] not only offers rudimentary graph dra- 
wing tools, restricted to changing node shape to represent 
male and females, but also error checking algorithms to 
try and identify potential pedigree errors where appro- 
priate genotypic data exists. 

Visualization techniques such as sunbursts [27] which 
are space filling versions of a node-link diagram have the 
advantage that a node s position in a hierarchy is main- 
tained. Additionally, Fan Charts [28] and H-trees [29] 
have also been described as a means for recounting hu- 
man genealogy; these techniques however assume no in- 
breeding (they are trees and not graphs) and thus rule 
themselves out for use with plant pedigrees. 

While the main problems with these additional tech- 
niques are that they are not appropriate for observing a 
pedigree in its entirety (indeed the complexity of the data 
may rule many of them out), they may be useful when 
trying to visualize a sub-section of data such as a sub- 
pedigree for specific lines. 

Layout algorithms 

Plant pedigrees often form what we describe as a pedi- 
gree net, whereby there is structure to the graph but its 
not as simple as traditional top-down pedigree represen- 
tation that is seen in humans and to a lesser extent in 
farmed animals (Figure 2). 

This abstract representation does include a time com- 
ponent in the form of generations, but due to the viabil- 
ity of seed, and the existence of varieties and landraces 
that may be many hundreds of years old, there is the po- 
tential to use these older varieties in modern crosses. 
This situation leads to nodes at the top of the graph hav- 
ing edges connecting to nodes at the bottom - this is not 
common in animals and would be extremely unlikely in 
humans. The existence of a time component means that 
the use of a layout algorithm that preserves topology 
(top-down generations) is nonetheless important as most 
(but not all) crossing will be between newer varieties. 
Because of this, layout methodologies such as force- 
directed algorithms (Figure 3B) would not offer the abi- 
lity for us to arrange our pedigree based on time. Force 
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Figure 2 Difference between plant A and animal B pedigree structure and shape. The plant pedigree in A shows what we have called a 
pedigree net structure and is more random in shape to a typical human or mammalian pedigree as shown in B where a typical pig pedigree 
shows a structured pyramidal topology. Sires are coloured blue and dams red with generations running top to bottom. Both diagrams were 
created using our initial paper-based prototype tool; see "Paper-based pedigree visualization" for more information. 



directed layouts are not well suited to our requirements. 
The lack of a visually identifiable pedigree structure is 
strikingly apparent. 

The problem of very large pedigrees in humans has 
been identified and solutions proposed in tools such as 
PViN [30] which looks at windows on large datasets but 
only offers pedigree drawing with no scope for addition 
of other information onto the visualization. In addition, 
its traditional human family tree output is not the most 
efficient use of space for plant pedigrees which form a 
more dense net due to the nature of reproduction which 
is not seen in humans or animals (Figure 2A) 

Although there are problems associated with 2D node- 
link layouts such as a lack of horizontal space and pro- 
blems with crossing of edges [31] they are still well 
suited to displaying data of this type. 3D tools also have 



their problems, including visual occlusion and that they 
tend to visualise high-level features and not specifics, so 
while some trends are easy to spot, the actual detail is 
hidden from the user. From this point of view they are 
limited in use for our purposes and offer no advantages 
over their 2D counterparts. Notable examples of such 
tools are Walrus [32] and Celestial3D [31] but their suc- 
cess lies in alternate problem domains. 



Discussion 

It is clear that these techniques and tools contain many 
features that are useful, but none meet the exact require- 
ments (including data abstraction) of our problem to be 
able to overlay genotypic and phenotypic data onto a 
complex pedigree structure. 
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Figure 3 Layout types. Barley pedigree for Quench which is a commonly used high yielding spring malting variety from Syngenta. It is a cross 
between Sebastian and Drum. A. shows the standard Sugiyama-style layered layout using dot and B. the same data using the fdp force-directed 
layout. Both layout tools are from GraphViz. The traditional pedigree top to bottom topology is lost using force-directed layout algorithms. 

V J 



There is a need for the development of tools that are 
tailored for the unique needs of plant breeding with the 
ability to explore pedigree structure, and paint additional 
genotypic and phenotypic data on top, to allow breeders 
to make informed decisions and visualize the way in 
which alleles for agriculturally important traits are trans- 
mitted through previous and subsequent generations. 
Such tools do not currently exist. 

Through the examination of methodologies to display 
pedigree data we suggest that the best method to visualize 
plant pedigree data is a layered layout (Sugiyama-style) 
based approach (Figures 2 A and 3A). Not only does this 
allow us to accurately map the exact specifics of how 
breeding programmes run (including inbreeding) but also 
provides a well-established framework onto which a 
visualization can be built. The use of graphs as our data 
structure means that features such as standard graph- 
traversal algorithms can be used to bring greater func- 
tionality to our pedigree structure in locating ancestors 
and descendants and as a logical framework which can be 
used to look for problems with underlying datasets. The 
layered layout representation also brings a coherent struc- 
ture to sparse relationships and generations and topo- 
logical layout are clearer compared to matrix style layouts. 
This is not the case with animal (Figure 2B) and human 
pedigrees whose top-down fan type shape is not well 
suited to a layered layout as they quickly become very 
large, consuming large volumes of horizontal space [17]. 

Tools that allow exploration of data to try and bring a 
greater understanding of complex relationships between 
individuals should bring greater insight into how plant 
breeding programmes operate at the genetic level and 
how to bring maximum potential benefit from them. The 
ability to detect patterns and associations (or even anoma- 
lies) within these datasets such as; the identification of 



problems with inheritance of alleles, the identification of 
lines from which additional information would allow in- 
ference of data on large parts of the pedigree, simple typos 
and errors, or looking for lines which are similar to un- 
known lines, will lead to increased depth of domain know- 
ledge for plant breeders and geneticists. 

Methods 

Paper-based visualization 

We wanted to test if our use of a DAG based data struc- 
ture and layered layout approach would work with our 
barley pedigree data and would be accepted by our users. 
In order to do this a paper-based layout was imple- 
mented, overlaying basic character data on to the graph 
nodes represented by colour and sizing nodes based on 
the number of times they had been used in crosses in 
our data. In this prototype (which was implemented in 
Perl and the Graphviz dot library) our pedigree was 
modelled as graph nodes to represent plant lines and 
edges to show mating/parentage. While GraphViz has 
been used before in pedigree drawing [33], examples 
focus on a small number of individuals. 

While initially this prototype was run by users as a 
command-line computer program which generated im- 
ages based on input _les and generated an image which 
could be viewed on their computer monitors it was de- 
cided that printing this static representation (2.5 m x 1 m 
see Figure 4) would allow domain experts to better inter- 
act with the visualization. We overlaid, by means of col- 
ouring nodes, the winter/spring ecotype category on this 
dataset as (along with the 2-row/6-row ecotype) it is the 
most commonly used physiological means of differen- 
tiating barley varieties, and one that all of our test users 
were familiar with. This tool was also implemented as 
a web-service which allowed us to include static (but 
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Figure 4 Helium static prototype. Users interacting witli tine large 
static prototype implementation of our pedigree visualization. In this 
example 2 and 6 row ecotypes are coloured green and blue respectively 
and varieties (nodes) are sized based on their contribution to the overall 
pedigree in terms of the number of lines that are derived from them. 
This shows which lines are most commonly used in barley breeding in 
the UK. The 2/6 row division is one of the important characteristics used 
to differentiate barley worldwide. 2 row barley is primarily of spring type 
and used in brewing and distilling while 6 row varieties are used in feed 
due to decreased quality characteristics for the brewing and distilling 
industry. Consent was obtained from both participants for publication 
of their images. 



dynamically generated) pedigree representations within 
our internal barley information portal 

Feedback on paper-based prototype 

Through observation and talking to twelve geneticists 
and plant breeders while they interacted with our wall- 
mounted visualization it was clear that there were a 
number of issues associated with this implementation. 
Firstly, it was almost impossible to trace edges between 
nodes when the data was dense (even at a large output 
size) so we found ourselves falling back on examining 
text based records to confirm lineage. Secondly, it is 
incredibly challenging to quickly locate specific plant 
lines with this density of data. Commonly used lines are 
immediately identifiable due to the use of size to repre- 
sent the number of uses in breeding crosses but these 
are not always what users are most interested in. Users 
used these larger nodes as reference points, almost as if 
they were notable points on a map [34,35] and attempts 
at using slightly different layouts or orientations were 
not well received. 

It was also clear that users were beginning to quickly 
spot pedigree problems. These problems related to the 
parentage of lines and in some cases the assignation of 
ecotype. These types of errors would be extremely diffi- 
cult for a user without extensive experience to pick up 
on and this has not only shown that it is an effective 
technique for visualization but also an effective way of 
identifying errors with underlying datasets. 

Users liked this representation of large pedigrees. Not 
only is it visually attractive, but geneticists were using it 



to identify problems with the underlying pedigree and 
phenotypic data in a way that is more interactive, social, 
and tactile compared to the examination of records. 

When presented with our results, plant breeders told us 
that it gave them an overview of their data that was not 
currently available to them; indeed these representations 
uncovered interesting information relating to the relative 
frequency of use of particular \key" lines in the UK Elite 
Barley germplasm that would have been difficult to see 
from textual records in the format seen in 'Pedigree for- 
mats' subsection, such records have not been collated like 
this before. Missing data was also easily spotted thus 
allowing us to update our underlying datasets. 

Problems do however exist, especially in the inability 
to search for particular plant varieties and tracing of 
edges to establish lineage. In order to try and address 
these, it was quickly realised that we would need to 
move towards the development of a more interactive 
software tool - Helium - named after the balloon type 
appearance of our static prototype. 

The Helium prototype 

Taking the feedback obtained from our initial informal 
user testing, an interactive detail and overview [36] 
prototype pedigree visualization system using Java and 
the yFiles library from yWorks [37] (Figure 5) was 
implemented. This prototype maintained the same visual 
metaphors (nodes and edges) to describe pedigree struc- 
ture but now could add features to allow users to search 
and explore the data and link in plant passport, pheno- 
type and background data from our Germinate database. 
One of the design decisions to use Germinate was that 
we can ensure that researchers working on our barley 
data will all be using the same data from the same 
source. 

While our paper prototype included a single static 
image it was clear that when users were viewing our 
visualization on computer monitors there would be a 
limitation on the number of nodes that could be dis- 
played while still retaining legibility of line names. To 
address this our main visualization panel (Figure 5A) 
can be zoomed and panned to allow users to explore 
data. An overview panel was added (Figure 5B) which 
would allow users to track where they were in the main 
visualization window and give a high-level overview of 
the pedigree structure. The overview would act as a 
common reference point for our users that would not 
change as the main visualization window was manipu- 
lated. Feedback from our paper implementation also 
showed that users would want to get as much back- 
ground information as possible on lines and so a detail 
panel was added (Figure 5C) which displays passport and 
general background information. Data from Germinate is 
displayed in the detail panel and is pulled on demand 
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Figure 5 First Java implementation of Helium showing node sizing and colouring and basic connection to our in-house Germinate 
barley database. Users can pan and zoom around the display and perform basic searches as well as overlay simple nominal and ordinal data 
which is loaded from the database backend. This version of Helium was used in subsequent user testing to steer the development of our more 
advanced system. The colouring in this figure shows predecessors (ancestors) in green, the line of interest in blue and descendants (successors) 
in purple. Figure 5A shows the main visualization window. Figure 5B shows the overview panel and Figure 5C the details panel for this interface. 



based on a users selection in the main visualization 
window. 

Germinate also includes phenotypic data of both nom- 
inal and ordinal types which were colour-coded in Helium 
using ColorBrewer2 palettes [38,39]. Hue was used to 
differentiate nominal data and saturation to distinguish 
ordinal data classes for phenotypes and genetic similarity 
metrics within our visualization [40] (Figure 6). 

User testing of the Helium prototype 

User testing is an important aspect of the development 
lifecycle of visualization [41-43]. Both Munzner and Lam 
lay out the requirements for testing, specifically relating to 
visualization studies in both contemplation and reflection 
of user studies. A subjective evaluation was performed to 
establish user perception/acceptance and understanding 
of the visualization methods within Helium. This was to 
establish empirically if users were happy with representing 
data as graphs, moving away from the traditional family- 
tree type methods, and whether the use of graphs fits in 
with a users perception of pedigree structure and func- 
tion. Could our users perform basic pedigree operations 
such as accurately tracking back through generations and 
find information they require using our visualization? We 



also wanted to ensure that users were able to interact well 
with our methods which allow much greater data density 
and increased plant line density. 

The testing data was obtained through a questionnaire 
and comment-based feedback based on how intuitive 
our users found the main features of the prototype to 
be. We also asked how our tool could be improved rela- 
ting to general usage or new features. This is important 
as while initial user-requirements were gathered, when 
our users actually started using our software we had ex- 
pected them to come up with new ideas on features or 
utility that would benefit their research. 

This feedback allowed us to improve our interface and 
visualization to help increase our users understanding of 
the system and underlying biological concepts. 

User testing methodology 

A pre-screening questionnaire, user tasks, and a follow up 
questionnaire centred on predefined tasks that users 
would be asked to perform was developed. The initial 
questions were to gain an overall impression of the length 
of experience the user has had in this field, and to classify 
their job title. There are two distinct groups of poten- 
tial users: bioinformaticians/computational biologists and 
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Figure 6 The Helium interface. Our visualization tool has been split into a number of distinct areas which are shown here. Our choice in using 
these coordinated multiple views is a common design choice in visualization interfaces [36]. The use of a high-level overview window (Figure 6A) 
and main view (Figure 6B) helps users maintain orientation and provides a filtering mechanism for a detailed local view (Figure 6C). This allows 
easier tracing of lineage whilst maintaining greater context with the main visualization window. A) Data selection and overview panel. This 
contains a high level overview of the entire pedigree and does not change (apart from node colouring and resizing) during a user's interaction 
with the application. This is intended to be used as a reference point for large pedigrees. This panel also contains a set of tabbed panes that let a 
user select phenotypic or genotypic data to overlay as well as further information from our Germinate database. B) The main pedigree visualization 
panel is where most manipulation happens. Users can pan and zoom across this window. The nodes are hotspots both for selection using the mouse 
or by hovering additional information is displayed about both nodes and edges. C) The local view panel contains our local view of a selected line and 
offers tools to allow a user to define how many generations (forwards and backwards) they want to view. The local view removes much of the visual 
clutter associated with the main pedigree visualization and allows edges to be more easily tracked. D) The detail panel contains search functionality 
and overview statistics. This example shows colouring for the ordinal data type "Anthocyanin Colour" which is a DUS character. 

V J 



plant geneticists (experimental) /breeders (applied). User 
tasks were developed using our initial application require- 
ments and were designed to force the users conducting 
the test to explore our experimental test datasets. The fol- 
low up questionnaire was clearly split into two sections; 
the first taking the form of attitude-scale questions on the 
users opinion on the software and visualization in terms 
of both their use of it (assuming comparison to their 
current method of viewing these data types), and follow 
up subjective open-ended questions to get additional in- 
formation that could be used to drive development of this 
software tool. 

The questions assume that a comparison is being 
made to other methods that test subjects are, or have 
been using to obtain the same information, and we can 
use these to signif)^ if our visualization and user interface 
brings significant improvements in visual representation 
and understanding of pedigree structure. Throughout 



the study, notes were taken and screen and audio cap- 
ture was used to further examine a users interaction 
with the interface and to aid in recount of the tests. 
Each test was scheduled to take around 45 minutes; 

• 5 minutes - pre-questionnaire 

• 5 minutes - familiarisation 

• 25 minutes - test 

• 10 minutes - post-test questionnaire 

After completion of the main interaction study our users 
completed an attitude scale where they indicated their 
preference on a 5 point scale between "Very Difficult" (1) 
to "Very Easy" (5) relating to a number of statements 
about their use of this software. 

The questionnaire asked users to detail features or 
concepts that they found to be confusing, those they 
found to be clear, and features that they feel would add 
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value to their research. Finally users were asked to pro- 
vide general comments about their use of our software; 
this would be used to allow us to tweak and fine-tune 
the Helium interface to aid our users with their 
research. 

Results 

General background profiling 

The 16 expert users that undertook this study break 
down as follows; 5 bioinformaticians, 10 plant geneticists 
and breeders and 1 statistician. Out of the users 94% 
were educated to PhD/MSc level and the average length 
of time working in their areas was 17 years. The mini- 
mum experience was 1 year, maximum 36 years giving a 
median length of experience of 13.5 years. 

While all users were familiar with pedigree data, 69% 
used it on a day-to-day basis as part of their research 
and 38% regularly used alternative tools. 

It should be noted that through verbal feedback it was 
established that the researchers who were using pedigree 
data were using paper records and spreadsheets to cur- 
ate and maintain pedigree data used in their work and 
not a specific pedigree tool. 

Main user interaction study 

There were eight questions that users were asked to an- 
swer in using our pedigree interface. The questions were 
assigned an overall category and can be seen in Table 1 
where we show the question classification along with the 
number of correct and incorrect responses. 

Our user testing uncovered some interesting problems 
with our visualization. For example, the category "Identi- 
fying Children" from Table 1 asked our participants to 
identify the progeny of a specific barley variety. In 44% of 
completed questionnaires this answer was incorrectly 
given. However, when examining "Tracing Lineage" from 
Table 2 which related to this question, users thought that 
it was easy to trace lineage by following graph edges. Our 
test users were continually missing the same progeny (one 
of three) of the line; the one whose complete edge was not 



Table 1 Interaction study correct answers 



Question classification 


Correct (%) 


Incorrect (%) 


Unexplained concepts 


50 


50 


Simple grandparent tracking 


93.75 


6.25 


Identifying children 


56.25 


43.75 


Complex grandparent tracking 


50 


50 


Phenotype classes 


100 


0 


Great-grandparent tracking 


37.50 


62.50 


Finding additional information 


93.75 


6.25 


Colour coding perception 


56.25 


43.75 



Table 2 Post study questions (Scaled/Likert 5 very easy, 1 
very difficult) 



Question classification 


Median (M) 


Mode 


Colour coding 


3 


3 


Phenotype classes 


4 


4 


Maintaining position 


4 


4 


Clarity of relationships 


4 


5 


Tracing lineage 


4 


4 


Understanding data 


4 


4 


Background information 


4.5 


5 


Ease of use 


4 


4 


Finding parents 


4.5 


5 


Navigation 


5 


5 


Children 


5 


5 


Finding lines 


5 


5 



immediately visible, and disappeared off the right-hand 
side of their computer display. When talking to a selection 
of users after the test had been carried out and asking 
them to perform the same question they did so without 
error (obviously suspicious to the reasons behind the 
request). 

Post-study questionnaires (attitudinal and open ended) 

After carrying out our main interaction study the users 
were asked to fill in a series of questions that asked 
them to compare Helium to pedigree tools, or methods 
of handling pedigree data that they are familiar with 
using, and to get feedback on what they found easy and 
difficult to understand or perform with Helium. These 
results are presented in Table 2. 

Test results discussion 

The most common responses have been detailed by divi- 
ding them into features users liked and disliked. These 
were obtained from feedback gained in our post-study 
questionnaire. 

Features users liked 

1. Layout was easy to understand and made scientific 
sense to users. 2. It was easy to follow edges. 3. Searching 
for plant lines was simple. 4. Bringing together additional 
data sources was extremely helpful. 

Features users disliked or found confusing 

1. Sometimes difficult to differentiate colour coding. 2. 
Long edges are disorientating. 3. No auto-selection of 
lines when performing a search. 4. Clearer explanations 
of ordinal data categories. 
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Our test users liked the speed at which they could find 
data, the ease of tracing lineage through complex graphs 
(although our testing has shown that there were issues 
with this) and the intuitive layout of our visualization 
and supporting application. Our testing did highlight 
some issues, mainly around the use of colour gradients 
used in ordinal lists which are ineffective and difficult 
for our users to distinguish when there are more than 
eight phenotype classes. 

Development of Helium 

Feedback from the user evaluation allowed us to address 
issues that our users had with our prototype in order to 
develop a more refined and useful visualization applica- 
tion. We needed to work to increase understanding of 
concepts, representations and visual metaphors that our 
users found difficult to understand during testing. 

The main feedback gained from our initial prototype 
was that it was difficult to track lineage with overlapping 
edges and that the ability to interactively overlay, query 
and retrieve various data types from our internal barley 
database would be important. Our users also had pro- 
blems with identifying phenotype classes. Other issues 
were with the complexity of the graphs and problems 
identifying children. 

Any subsequent development would need to address 
these points if it was going to offer a usable and effective 
tool for users. 

The interface was re-designed to show 4 main areas: a) 
the overview panel and data selection panel, b) the main 
pedigree visualization panel, c) the local view panel and 
finally d) the details panel. These are described below. 

Overview and data selection panel 

This panel (Figure 6 A) also includes selection mecha- 
nisms for choosing ordinal and nominal categorical 
phenotypic classes as well as tools for visualizing genetic 
similarity data (Figure 7). Users can use the overview to 
navigate to a particular region within the main visua- 
lization window if required. 

Interactive sliders allow users, in the case of similarity 
data, to set a percentage similarity value and in real time 
highlight lines which match the search criteria (Figure 7). 
In this way it is possible to see lines which should not be 
closely related appearing on the peripheries of our vi- 
sualization as the slider is moved, which may indicate 
problems with pedigree definition or genotyping. Histo- 
grams have also been included, where appropriate, to 
show data distribution which can be an aid in the iden- 
tification of problem markers. While the number of 
markers that have this problem is limited, it is nonethe- 
less important to address. 

Other features included in this panel are the ability to 
select more than one phenotype then recolour nodes 



based on the merged phenotype classes. While originally 
it had been intended to show each phenotype as a dif- 
ferent section on a node it was decided, through spea- 
king to users, that they would be interested in finding 
exact combinations and so it was decided to go with 
the single node colour to reduce clutter and keep the 
visualization clearer. There are however problems as the 
number of colours that may have to be used can be 
around 20. Such a high number has been shown to be 
ineffectual at differentiating between classes [40,44,45]. 

Main visualization panel 

The main visualization window (Figure 6B) was modified 
in a number of ways from our prototype. Firstly, we 
have moved away from bundled orthogonal edge routing 
(Figure 5) which will make the tracing of lineage easier. 
Slightly modified colour palettes were used to account for 
the situation where there are more than eight categorical 
classes. The new colour palette will help with the problem 
our testing showed where adjacent classes were too simi- 
lar in colour for users to accurately distinguish. In Table 1 
the incorrect responses to "Identifying Children" were 
high at 43.75%. In order to address this visual prompts 
when hovering over a node were added which display the 
number of ingoing and outgoing edges from a node and 
the names of the lines progeny (Figure 6B). This makes 
the number of progeny immediately obvious, which will 
help prevent some of the problems seen in testing. When 
a user selects a node the edges connecting nodes of inter- 
est are made more prominent by both removing edges 
which are not associated with the selected node, its ances- 
tors, or successor, and by darkening the edges which are 
left. Hovering over a graph edge will show the names of 
the two nodes that it connects, in this way with long 
edges, while using the main visualization window, it is 
easier to track their origin and destination. 

Local view panel 

Our testing also showed that while users reported they 
found it easy to identify lineage there were some issues. 
These problems could be addressed by including a "local" 
implementation of our graph showing only the line of 
interest and its lineage (Figure 6C). This would be shown 
when a user selects a node in our visualization. This view 
was implemented below the main visualization window. 
The local view can be panned and zoomed in the same 
way as the main visualization window. Within the local 
view the user has control of how many generations, for- 
wards and bacl<wards, they want to go. This addresses the 
problems highlighted in Table 1 where there were 50% 
and 62.5% of users incorrectly answering the "Complex 
Grandparent Tracking' and "Great-Grandparent Tracking' 
questions respectively. With appropriate selection of gene- 
ration level, grandparents, or indeed any other generation. 
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Figure 7 Helium siiowing genetic similarity data. Genetic similarity data is stored in our Germinate database (all-by-all pairwise comparisons) 
which is displayed to the user by selecting a base node then showing similarity of lines in relation to this node. When the user selects another 
node the new data is retrieved and displayed. A) A slider allows users to select a cut-off or the similarity values (45-100%) and the results are 
shows in a sortable table below. The histogram shows the data distribution for the selected line which is one of many indications on the quality 
of the data. Selecting a row from the table jumps to that line, then updates the visualization accordingly. B) Coloured nodes to show similarity to 
the base line, and node sizing to show the number of times a line has been used in subsequent crosses (the larger the node the greater the number 
times it has been used as a parent). C) When a line is selected from the main display only edges joining the selected edge with predecessors or 
ancestors are shown which reduces display clutter. 



are now immediately obvious in the simplified pedigree. 
Additionally, the ability to layout the graph using a number 
of edge routing algorithms was added. Any changes made 
to the main pedigree visualization are propagated to the 
local view. While the local view includes another copy of a 
portion of the main visualization, it will increase the accur- 
acy of tracing lineage when unnecessary lines are removed 
and edges between nodes shortened, thus addressing the 
problems highlighted in testing and reducing the need to 
"chase edges". 

Detail panel 

The details panel (Figure 6D) shows information about 
either the current selected phenotype(s) or information 
from Germinate about specific selected plant lines. This 
example shows the distribution of the DUS character 
"Anthocyanin Colour". The histogram has been coloured 
in the same way as the phenotype classes in the main 
visualization window. 



The details panel also houses a search functionality 
which allows searching for lines with usual search fea- 
tures such as wild- card matching and an option which 
we have called the "follow me" mode which jumps to a 
search hit, selects it and subsequently updates the detail 
panel and main visualization window. 

During discussions with users it was also apparent that 
the ability to export line names would be a useful feature 
to allow scientists to make up lists for sending samples 
off for genotyping based on phenotypic or genotypic 
characteristics so the ability to allow users to export lists 
has been implemented. Users can select nodes then add 
them to an export list which can be saved to a text file. 

Finally, a user history panel has been included which 
records the lines and phenotypes that have been selected 
over a session so that if required, users can go back and 
see what they had been doing previously. This is impor- 
tant as with large quantities of data it is easy for users to 
forget what they have been doing over time. 
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Examples of the layout and features offered by Helium 
can be seen in Additional file 1. 



Discussion 

An interesting outcome of the development of Helium is 
trying to quantify if this tool actually make a user's deci- 
sion making better and does the software influence users 
into making more informed decisions about their data. 
One of the outcomes from our testing was to assure our- 
selves that the decisions that had been made around the 
design of the tool were actually good foundations that 
our target users can build knowledge on and to that end 
we seem to have made an impact. While we have used 
standard approaches to the visualization tool we have 
developed we have applied it directly to a specific do- 
main, and tailored our application appropriately. 

While users requested as much information as pos- 
sible in the interface we need to be careful that we only 
include necessary information and do not turn Helium 
into a tool that presents so much unnecessary informa- 
tion to users it in itself becomes unusable or difficult to 
comprehend; we need to avoid a situation where we 
overload users with information. While this may seem 
like a problem that scientists would love to have it could 
have detrimental effects; do we need to actually present 
raw data or are overviews enough? Would a user s un- 
derstanding be affected by what we present them with? 



Users have told us that the overlaying of data onto the 
pedigree structure has in some ways more impact than 
showing the division of data in a bar chart or as a table. 
Having areas of colour in your face brings insight both 
into the location of clusters of similar data and visual 
impact of nodes changing from one colour to another, it 
brings the representation of data to life and in logical an 
understandable ways. 

Examples of the sorts of things that users wanted to 
be able to do with our tool include a) given genotype 
data for a line identify possible matches and b) basic 
error checking based on genotypic or phenotypic data. 
These are detailed below. 

Given genotype data for a line identify possible matches 

Helium will take a string of genotypic data and identify 
possible matches from data held in our Germinate data- 
base then display the possible hits on the pedigree display. 
This is useftil as it is not uncommon for errors to be intro- 
duced through mislabelling or handling errors in the lab 
when genetic material is sent for genotyping. Using the 
pedigree framework may give users other ways of trying to 
identify what unknown or problem lines are, or they may 
point geneticists and breeders in the right direction as to 
their source, if for example two similar lines are mis- 
labelled we may be able to deduce the correct naming 
through examination of pedigree records. Further investi- 
gation would be required to correctly identify the correct 
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Figure 8 Pedigree visualization static prototype. This was one of our first attempts at visualizing our entire barley pedigree. The colours of 
nodes were used to distinguish between the winter/spring ecotype (red shows spring barley, blue shows winter barley and the cream coloured 
nodes are lines that are in both winter and spring pedigrees - qualitative data type) and node size to show the number of times the line has 
been used in crosses that have given rise to progeny that have been successful in National List trialling in the UK - quantitative data type. To the 
best of our knowledge this is the first time that a pedigree involving this number of commercially released lines has been brought together in 
one place and sparked interest with commercial plant breeders when they were presented with it. 
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source of this germplasm as there is a possibility either it, 
or the genotyping is wrong. These types of error are not 
uncommon. 

Basic error checking based on genotypic or 
phenotypic data 

We can use the interface to look for potential errors with 
a given line. We know that the alleles of a line must be 
from either parent, so we can use this in basic error 
checking. For example, if two lines have been genotyped 
for allele A at given locus but the progeny has allele B 
then we know there is a problem. Additionally, we can ex- 
pand this type of search to look at multiple loci within a 
dataset. Taking this a step further we can use genotypic 
data to highlight potential parents of a line and if one par- 
ent is known, make a guess at possible candidates for the 
second parent. 

Conclusions 

We have shown through the development of Helium 
that visualization of our example pedigrees along with 
genotypic and phenotypic data provides users with new 
insights into crop breeding. 

The representation of our unique barley test dataset 
shows that the pedigree structure takes the form of what 
we have coined a pedigree net. Our visualization has 
shown that there are three main classes of plant lines 
seen when viewed in Helium which we have named; a) 
principal lines which are commonly used to generate 
new cultivars due to their possession of desirable charac- 
teristics b) flanking cultivars brought in to increase the 
genetic diversity of subsequent lines and less commonly 
used in crosses and finally c) terminal varieties that are 
released, but have had little subsequent use. 

One of the more hard-hitting measures of success of 
our first paper-based prototype came from the presenta- 
tion of data to a meeting of UK plant breeders. While the 
pedigree data that we demonstrated was available to all in 
the room as written records, (like those in 'Pedigree 
formats' subsection), the representation that we showed 
(Figure 8) had a major impact through the provision of 
new insights as to how germplasm was very closely re- 
lated. When written as a text string it is difficult to con- 
struct the bigger picture, but when displayed in our tool, 
the relationships between competing breeders lines was 
much more striking. While this was privately known to 
the individual breeders, having it presented to them when 
they were all in the same room was very enlightening. 
This not only highlights the value of visualization but that 
we have implemented a visualization tool with real-world 
impact. 

While Helium has been tailored to specific data types 
(genotypic/similarity, nominal and ordinal phenotypic 
data and pedigree definitions) it is intended to be a 



framework on to which, over time, additional data types 
can be added and we are working with worldwide plant 
scientists and breeders to develop the Helium platform 
further. 

For more information on Helium please visit our web- 
site http://ics.hutton.ac.uk/helium. 

Additional file 



Additional file 1: Helium features movie. This movie shows the main 
layout and features of the Helium system along with basic interface 
interaction. 
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